After completing this session, students should be able to:
Practice global Needleman Wunsch algorithm in their notebook
Align a sequence
Download R
Download and align SARS-CoV-2 sequences from the State of Bahia, Brazil
ALGORITHM
We could have used code to fill in this matrix
For the Traceback step, we follow the pointers (the arrows)
We could have used code to fill in this matrix
For the Traceback step, we follow the pointers (the arrows)
BLAST
Fragment of SARS-CoV-2 sequence to blast:
An algorithm can also called a pipeline.
The first step to develop an algorithm is to objectively explain how to answer a question or solve a problem.
A Variant Calling algorithm identifies Variants or Mutations in the genome of an organism.
Also called SNP calling pipeline.
Here is the pseudocode of a SNP calling pipeline:
Align to reference sequence (FASTA)
Compare alignment to reference (SAM)
Annotate differences (mutations) (VCF)
Extract mutations from VCF using script
Construct a SNP Frequency Table
Schematic of a SNP call pipeline
The blue boxes indicates the analysis being performed.
The text above the boxes indicates the software used for each analysis. Figure is from r-charts (n.d.).
Software development considers the analytical steps in human language
Then, the software product considers the steps the machine will execute
How files are produced and what are the processing steps?
Where in the computational infra-structure are the files stored?
We can develop our own computational methods to understand biology and propose solutions
In order to do that we need to follow these three steps for developing a computational algorithm that will solve a problem:
In multiple sequences the alignment is much more significant than just two sequences
Score higher when multiple sequences align
The similarities refer to functional equivalence and evolutionary relationships between the two proteins
dna_sarsCov2_start_30000 <- readDNAStringSet(file="~/Desktop/Gepoliano/UFOB/sequencesfilogeniaData/bahia_combined_30000_muscle.fasta")Classifying proteins with Markov Chains
How can these files be accessed?
What information do the files containe?
The present program is about how a scientific question is answered, not what the final answer is
If how the question is answered is not addressed, opportunity is lost in terms of information that is embedded in the process of data analysis
This is an important notion to have when developing computational tools that answer a scientific question